Figure 1. Illustration of the emergence of Slash-Dominant Heads (SDHs). Attention scores are determined by pre-PE queries, keys, and RoPE (left bottom). Because token embeddings lie approximately on a cone, queries/keys are almost rank-one, and nearly identical across tokens (left top), so RoPE primarily governs variation of attention scores across tokens. Then RoPE's high- and medium-frequency components interact constructively at specific lags, producing the attention score peaks at offset $\Delta$ (right top). As a result, SDHs emerge and are  generalizable (right bottom).

Figure 1. Illustration of the emergence of Slash-Dominant Heads (SDHs). Attention scores are determined by pre-PE queries, keys, and RoPE (left bottom). Because token embeddings lie approximately on a cone, queries/keys are almost rank-one, and nearly identical across tokens (left top), so RoPE primarily governs variation of attention scores across tokens. Then RoPE's high- and medium-frequency components interact constructively at specific lags, producing the attention score peaks at offset $\Delta$ (right top). As a result, SDHs emerge and are generalizable (right bottom).

Authors: **Yuan Cheng*** Fengzhuo Zhang***** Yunlong Hou*** Cunxiao Du Chao Du Tianyu Pang Aixin Sun Zhuoran Yang**

*Co-First Authors

Demystifying the Slash Pattern in Attention: The Role of RoPE

<aside> 💡

TL;DR:

We investigate a key mechanism for information propagation in LLMs—slash-dominant heads (SDHs)—which exhibit a distinctive slash attention pattern, and demystify the emergence of SDHs as follows.

Citation

@online{cheng2026demystifyingslashpatternattention,
  title = {Demystifying the Slash Pattern in Attention: The Role of RoPE},
  author = {Cheng, Yuan and Zhang, Fengzhuo and Hou, Yunlong and Du, Cunxiao and Du, Chao and Pang, Tianyu and Sun, Aixin and Yang, Zhuoran},
  year = {2026},
  url={<https://arxiv.org/abs/2601.08297>} 
}

1. Concepts: Information Passing and Attention Patterns in LLMs

Given a prompt that contains a question, an LLM can interactively generate a coherent and contextually appropriate answer. A crucial ingredient behind this ability is the model's capability to pass information across different tokens in the sequence, most notably, from the prompt tokens to the answer tokens. In modern LLMs, the model's information-passing behavior is closely linked to a specific structural pattern: ****the slash pattern in its attention scores.

The slash pattern refers to the attention score concentrates along the $\Delta$-th sub-diagonal of the attention score matrix and thus forms a slash line (Figure 2). We refer to attention heads exhibiting slash patterns as Slash Dominant Heads (SDHs), which are formally defined as follows.

Definition 1: $(\kappa, \Delta)$-Slash-Dominance

Intuitively, the average slash score measures the average attention paid to tokens that are $\Delta$ positions before the current token.

SDHs and their slash patterns play important algorithmic roles in LLMs. For example, they enable In-context Learning (ICL) via the induction head circuit, which is a special case of an SDH with $\Delta = 1$. In addition, another line of work such as XAttention and MTraining leverage slash patterns to help accelerate long-context inference or training.

Figure 2. Average of attention score matrices in Qwen2.5-7B-Instruct with prompts from LongBench. We denote the a-th head at b-th layer as LbHa in this blog. In panels (a)–(c), attention concentrates on the sub-diagonals with small offsets 0,1 and 2, respectively. In panels (d)–(f), it also concentrates on sub-diagonals with large offsets exceeding 500.

Figure 2. Average of attention score matrices in Qwen2.5-7B-Instruct with prompts from LongBench. We denote the a-th head at b-th layer as LbHa in this blog. In panels (a)–(c), attention concentrates on the sub-diagonals with small offsets 0,1 and 2, respectively. In panels (d)–(f), it also concentrates on sub-diagonals with large offsets exceeding 500.

As shown in Figure 2, SDHs with diverse values of $\Delta$ are prevalent in modern open-source LLMs (Qwen). These SDHs enable a token at position $i$ to attend directly to the token at position $i-\Delta$, thereby passing information from earlier tokens to later ones. Their widespread presence and functional importance naturally motivate our central research question:

How do pretrained LLMs implement SDHs using their transformer architectures?

To answer this question, we first need to know what determines the attention scores in LLMs.


2. Background: What determines the attention scores?

In this section, we introduce the backbone of modern transformer: Causal Self-attention Layer with RoPE, from which we can straightforwardly see what determines the attention scores.